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Abstract 

Can data from mobile phones be used to observe economic shocks and their consequences at 
multiple scales? Here we present novel methods to detect mass layoffs, identify individuals affected 
by them, and predict changes in aggregate unemployment rates using call detail records (CDRs) 
from mobile phones. Using the closure of a large manufacturing plant as a case study, we first 
describe a structural break model to correctly detect the date of a mass layoff and estimate its size. 
We then use a Bayesian classification model to identify affected individuals by observing changes in 
calling behavior following the plant’s closure. For these affected individuals, we observe significant 
declines in social behavior and mobility following job loss. Using the features identified at the micro 
level, we show that the same changes in these calling behaviors, aggregated at the regional level, can 
improve forecasts of macro unemployment rates. These methods and results highlight promise of 
new data resources to measure micro economic behavior and improve estimates of critical economic 
indicators. 

Keywords: unemployment — computational social science — social networks — mobility — complex systems 
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Economic statistics are critical for decision-making by both government and private in¬ 
stitutions. Despite their great importance, current measurements draw on limited sources 
of information, losing precision with potentially dire consequences. The beginning of the 
Great Recession offers a powerful case study: the initial BEA estimate of the contraction 
of GDP in the fourth quarter of 2008 was an annual rate 3.8%. The American Recov¬ 
ery and Reinvestment Act (stimulus) was passed based on this understanding in February 
2009. Less than two weeks after the plan was passed, that 3.8% figure was revised to 6.2%, 
and subsequent revisions peg the number at a jaw dropping 8.9% - more severe than the 
worst quarter during the Great Depression. The government statistics were wrong and may 
have hampered an effective intervention. As participation rates in unemployment surveys 
drop, serious questions have been raised as to the declining accuracy and increased bias in 
unemployment numbers |T]. 

In this paper we offer a methodology to infer changes in the macro economy in near 
real time, at arbitrarily fine spatial granularity, using data already passively collected from 
mobile phones. We demonstrate the reliability of these techniques by studying data from 
two European countries. In the first, we show it is possible to observe mass layoffs and 
identify the users affected by them in mobile phone records. We then track the mobility 
and social interactions of these affected workers and observe that job loss has a systematic 
dampening effect on their social and mobility behavior. Having observed an effect in the 
micro data, we apply our findings to the macro scale by creating corresponding features 
to predict unemployment rates at the province scale. In the second country, where the 
macro-level data is available, we show that changes in mobility and social behavior predict 
unemployment rates ahead of official reports and more accurately than traditional forecasts. 
These results demonstrate the promise of using new data to bridge the gap between micro 
and macro economic behaviors and track important economic indicators. Figure [l] shows a 
schematic of our methodology. 
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FIG. 1. A schematic view of the relationship between job loss and call dynamics. We use the calling 
behavior of individuals to infer job loss and measure its effects. We then measure these variables and 
include them in predictions of unemployment at the macro scale, significantly improving forecasts. 

MEASURING THE ECONOMY 

Contemporary macroeconomic statistics are based on a paradigm of data collection and 
analysis begun in the 1930s M- Most economic statistics are constructed from either 
survey data or administrative records. For example, the US unemployment rate is calculated 
based on the monthly Current Population Survey of roughly 60,000 households, and the 
Bureau of Labor Statistics manually collects 80,000 prices a month to calculate inflation. 
Both administrative databases and surveys can be slow to collect, costly to administer, and 
fail to capture significant segments of the economy. These surveys can quickly face sample 
size limitations at fine geographies and require strong assumptions about the consistency of 
responses over time. Statistics inferred from survey methods have considerable uncertainty 
and are routinely revised in months following their release as other data is slowly collected 
[HIM. Moreover, changes in survey methodology can result in adjustments of reported 
rates of up to 1-2 percentage points [7]. 

The current survey-based paradigm also makes it challenging to study the effect of eco¬ 
nomic shocks on networks or behavior without reliable self-reports. This has hampered 
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scientific research. For example, many studies have documented the severe negative con¬ 
sequences of job loss in the form of difficulties in retirement [j£], persistently lower wages 
following re-employment including even negative effects on children’s outcomes PHD], in¬ 
creased risk of death and illness mm, higher likelihood of divorce ra. and, unsurprisingly, 
negative impacts and on happiness and emotional well-being ra. Due to the cost of ob¬ 
taining the necessary data, however, social scientists have been unable to directly observe 
the large negative impact of a layoff on the frequency and stability of an individual’s social 
interactions or mobility. 


PREDICTING THE PRESENT 

These shortcomings raise the question as to whether existing methods could be supple¬ 
mented by large-scale behavioral trace data. There have been substantial efforts to discern 
important population events from such data, captured by the pithy phrase of, “predict¬ 
ing the present ” mm- Prior work has linked news stories with stock prices [EMU and 
used web search or social media data to forecast labor markets [22fT26| . consumer behav¬ 
ior [27j |28|, automobile demand, vacation destinations mm. Research on social media, 
search, and surfing behavior have been shown to signal emerging public health problems 
PUHTF] ; although for a cautionary tale see [3B] • And recent efforts have even been made to¬ 
wards leveraging Twitter to detect and track earthquakes in real-time detection faster than 
seismographic sensors pTtTl ITT] . While there are nuances to the analytic approaches taken, 
the dominant approach has been to extract features from some large scale observational 
data and to evaluate the predictive (correlation) value of those features with some set of 
measured aggregate outcomes (such as disease prevalence). Here we offer a twist on this 
methodology through identification of features from observational data and to cross validate 
across individual and aggregate levels. 

All of the applications of predicting the future are predicated in part on the presence of 
distinct signatures associated with the systemic event under examination. The key analytic 
challenge is to identify signals that (1) are observable or distinctive enough to rise above 
the background din, (2) are unique or generate few false positives, (3) contain information 
beyond well-understood patterns such as calendar-based fluctuations, and (4) are robust to 
manipulation. Mobile phone data, our focus here, are particularly promising for early detec- 
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tion of systemic events as they combine spatial and temporal comprehensiveness, naturally 
incorporate mobility and social network information, and are too costly to intentionally 
manipulate. 

Data from mobile phones has already proven extremely beneficial to understanding the 
everyday dynamics of social networks [T21 RTS1 and mobility patterns of millions With 

a fundamental understanding of regular behavior, it becomes possible to explore deviations 
caused by collective events such as emergencies 0. natural disasters [581159], and cultural 
occasions pa eu. Less has been done to link these data to economic behavior. In this 
paper we offer a methodology to robustly infer changes to measure employment shocks at 
extremely high spatial and temporal resolutions and improve critical economic indicators. 


DATA 

We focus our analysis at three levels: the individual, the community, and the provincial 
levels. We begin with unemployment at the community (town) level, where we examine the 
behavioral traces of a large-scale layoff event. At the community and individual levels, we 
analyze call record data from a service provider with an approximately 15% market share in 
an undisclosed European country. The community-level data set spans a 15 month period 
between 2006 and 2007, with the exception of a 6 week gap due to data extraction failures. 
At the province level, we examine call detail records from a service provider from another 
European country, with an approximately 20% market share and data running for 36 months 
from 2006 to 2009. Records in each data set include an anonymous id for caller and callee, 
the location of the tower through which the call was made, and the time the call occurred. 
In both cases we examine the universe of call records made over the provider’s network (see 
SI for more details). 


OBSERVING UNEMPLOYMENT AT THE COMMUNITY LEVEL 

We study the closure of an auto-parts manufacturing plant (the plant) that occurred in 
December, 2006. As a result of the plant closure, roughly 1,100 workers lost their jobs in 
a small community (the town) of 15,000. Our approach builds on recent papers [H2l fo4l [52] 
that use call record data to measure social and mobility patterns. 
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There are three mobile phone towers within close proximity of the town and the plant. 
The first is directly within the town, the second is roughly 3 km from the first and is geo¬ 
graphically closest to the manufacturing plant, while the third is roughly 6.5 km from the 
first two on a nearby hilltop. In total, these three towers serve an area of roughly 220 km 2 
of which only 6 km 2 is densely populated. There are no other towns in the region covered by 
these towers. Because the exact tower through which a call is routed may depend on factors 
beyond simple geographic proximity (e.g. obstructions due to buildings), we consider any 
call made from these three towers as having originated from the town or plant. 

We model the pre-closure daily population of the town as made up of a fraction of 
individuals 7 who will no longer make calls near the plant following its closure and the 
complimentary set of individuals who will remain (1 — 7 ). As a result of the layoff, the total 
number of calls made near the plant will drop by an amount corresponding to the daily calls 
of workers who are now absent. This amounts to a structural break model that we can use to 
estimate the prior probability that a user observed near the plant was laid off, the expected 
drop in calls that would identify them as an affected worker, and the time of the closure (see 
SI for full description of this model). We suspect that some workers laid off from the plant 
are residents of the town and thus they will still appear to make regular phone calls from 
the three towers and will not be counted as affected. Even with this limitation, we find a 
large change in behavior. 

To verify the date of the plant closing, we sum the number of daily calls from 1955 regular 
users (i.e. those who make at least one call from the town each month prior to the layoff) 
connecting through towers geographically proximate to the affected plant. The estimator 
selects a break date, U ayo ff , and pre- and post- break daily volume predictions to minimize 
the squared deviation of the model from the data. The estimated values are overlaid on 
daily call volume and the actual closure date in the Figure [2]A.. As is evident in the figure, 
the timing of the plant closure (as reported in newspapers and court filings) can be recovered 
statistically using this procedure - the optimized predictions display a sharp and significant 
reduction at this date. As a separate check to ensure this method is correctly identifying 
the break date, we estimate the same model for calls from each individual user i and find 
a distribution of these dates t\ ^ is peaked around the actual layoff date (see Figure 1 in 
SI). 
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FIG. 2. Identifying the layoff date. A) Total aggregate call volume (black line) from users who make 
regular calls from towers near the plant is plotted against our model (blue). The model predicts a 
sudden drop in aggregate call volume and correctly identifies the date of the plant closure as the 
one reported in newspapers and court records. B) Each of the top 300 users likely to have been laid 
off is represented by a row where we fill in a day as colored if a call was made near the plant on that 
day. White space marks the absence of calls. Rows are sorted by the assigned probability of that 
user being laid off according to our Bayesian model. Users with high probabilities cease making 
calls near the plant directly following the layoff. C) We see a sharp, sustained drop in the fraction 
of calls made near the plant by users assigned to the top decile in probability of being unemployed 
(red) while no affect is seen for the control group users believed to be unaffected (blue). Moreover, 
we see that laid off individuals have an additional drop off for a two week period roughly 125 days 
prior the plant closure. This time period was confirmed to be a coordinated vacation for workers 
providing further evidence we are correctly identifying laid off workers. 


OBSERVING UNEMPLOYMENT AT THE INDIVIDUAL LEVEL 

To identify users directly affected by the layoff, we calculate Bayesian probability weights 
based on changes in mobile phone activity. For each user, we calculate the conditional 
probability that a user is a non-resident worker laid off as part of the plant closure based 
on their observed pattern of calls. To do this, we compute the difference in the fraction of 
days on which a user made a call near the plant in 50 days prior to the week of the layoff. 
We denote this difference as A q = q pre — q post . We consider each user’s observed difference a 
single realization of a random variable, A q. Under the hypothesis that there is no change in 
behavior, the random variable A q is distributed N( 0, y/ gpre(1 5 ~ gpre) + w( 1 5 o 9p ° ^)- Under the 
alternative hypothesis the individual’s behavior changes pre- and post-layoff, the random 
variable A q is distributed N(d, \J 9pre( ' 1 m q — + — * ( 50 ' 9poat ' > ); where d is the mean reduction 
in calls from the plant for non-resident plant workers laid off when the plant was closed. We 
assign user i the following probability of having been laid off given his or her calling pattern: 
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Calculating the probabilities requires two parameters, 7 , our prior that an individual is 
a non-resident worker at the affected plant and d, the threshold we use for the alternative 
hypothesis. The values of 7 = 5.8% and d = 0.29 are determined based on values fit from 
the model in the previous section. 

Validating the Layoff 

On an individual level, Figure [2^3 shows days on which each user makes a call near the 
plant ranked from highest to lowest probability weight (only the top 300 users are shown, 
see Figure 2 in SI for more users). Users highly suspected of being laid off demonstrate a 
sharp decline in the number of days they make calls near the plant following the reported 
closure date. While we do not have ground-truth evidence that any of these mobile phone 
users was laid off, we find more support for our hypothesis by examining a two week period 
roughly 125 days prior to the plant closure. Figure [2p shows a sharp drop in the fraction 
of calls coming from this plant for users identified as laid off post closure. This period 
corresponds to a confirmed coordinated holiday for plant workers and statistical analysis 
confirms a highly significant break for individuals classified as plant workers in the layoff for 
this period. Given that we did not use call data from this period in our estimation of the 
Bayesian model, this provides strong evidence that we are correctly identifying the portion 
of users who were laid off by this closure. In aggregate, we assign 143 users probability 
weights between 50% and 100%. This represents 13% of the pre-closure plant workforce and 
compares closely with the roughly 15% national market share of the service provider. 


ASSESSING THE EFFECT OF UNEMPLOYMENT AT THE INDIVIDUAL LEVEL 

We now turn to analyzing behavioral changes associated with job loss at the individual 
level. We first consider six quantities related to the monthly social behavior: A) total calls, 
B) number of incoming calls, C) number of outgoing calls, D) calls made to individuals 
physically located in the town of the plant (as a proxy for contacts made at work), E) 
number of unique contacts, and F) the fraction of contacts called in the previous month 
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Monthly changes in behavior following a mass layoff 



Summary: Effects of mass layoffs 
on social behavior and mobility 
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FIG. 3. Changes in social networks and mobility following layoffs. We quantify the effect of 
mass layoffs relative to two control groups: users making regular calls from the town who were not 
identified as laid off and a random sample of users from the rest of the country. We report monthly 
point estimates for six social and three mobility behaviors: A) Total calls, B) number of incoming 
calls, C) number of outgoing calls, D) Fraction of calls to individuals in the town at the time of the 
call, E) number of unique contacts, and the fraction of individuals called in the previous month 
who were not called in the current month (churn), G) Number of unique towers visited, H) radius 
of gyration, I) average distance from most visited tower. Pooling months pre- and post-layoff yield 
statistically significant changes in monthly social network and mobility metrics following a mass 
layoff. J) Reports regression coefficient for each of our 9 dependent variables along with the 66% 
and 95% confidence intervals. 


that were not called in the current month, referred to as churn. In addition to measuring 
social behavior, we also quantify changes in three metrics related to mobility: G) number 
of unique locations visited, H) radius of gyration, and I) average distance from most visited 
tower (see SI for detailed definitions of these variables). To guard against outliers such as 
long trips for vacation or difficulty identifying important locations due to noise, we only 
consider months for users where more than 5 calls were made and locations where a user 
recorded more than three calls. 

We measure changes in these quantities using all calls made by each user (not just those 
near the plant) relative to months prior to the plant closure, weighting measurements by 
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the probability an individual is laid off and relative to two reference groups: individuals 
who make regular calls from the town but were not believed to be laid off (mathematically 
we weight this group using the inverse weights from our bayesian classifier) and a random 
sample of 10,000 mobile phone users throughout the country (all users in this sample are 
weighted equally). 

Figure [3]A-I shows monthly point estimates of the average difference between relevant 
characteristics of users believed to be laid off compared to control groups. This figure shows 
an abrupt change in variables in the month directly following the plant closure. Despite 
this abrupt change, data at the individual level are sufficiently noisy that the monthly 
point estimates are not significantly different from 0 in every month. However, when data 
from months pre- and post-layoff are pooled, these differences are robust and statistically 
significant. The right panel of Figure [3] and Table I in the SI show the results of OLS 
regressions comparing the pre-closure and post-closure periods for laid-off users relative to 
the two reference groups (see SI for detailed model specification as well as confidence intervals 
for percent changes pre- and post-layoff for each variable). The abrupt and sustained change 
in monthly behavior of individuals with a high probability of being laid off is compelling 
evidence in support of using mobile phones to detect mass layoffs with mobile phones. 

We find that the total number of calls made by laid off individuals drops 51% and 41% 
following the layoff when compared to non-laid off residents and random users, respectively. 
Moreover, this drop is asymmetric. The number of outgoing calls decreases by 54% percent 
compared to a 41% drop in incoming calls (using non-laid off residents as a baseline). Sim¬ 
ilarly, the number of unique contacts called in months following the closure is significantly 
lower for users likely to have been laid off. The fraction of calls made by a user to some¬ 
one physically located in the town drops 4.7 percentage points for laid off users compared 
to residents of the town who were not laid off. Finally, we find that the month-to-month 
churn of a laid off person’s social network increases roughly 3.6 percentage points relative to 
control groups. These results suggest that a user’s social interactions see significant decline 
and that their networks become less stable following job loss. This loss of social connections 
may amplify the negative consequences associated with job loss observed in other studies. 

For our mobility metrics, fold that the number of unique towers visited by laid-off individ¬ 
uals decreases 17% and 20% relative to the random sample and town sample, respectively. 
Radius of gyration falls by 20% and 22% while the average distance a user is found from 
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the most visited tower also decrease decreases by 26% relative to reference groups. These 
changes reflect a general decline in the mobility of individuals following job loss, another 
potential factor in long term consequences. 


OBSERVING UNEMPLOYMENT AT THE PROVINCE LEVEL 

The relationship between mass layoff events and these features of CDRs suggests a poten¬ 
tial for predicting important, large-scale unemployment trends based on the population’s call 
information. Provided the effects of general layoffs and unemployment are similar enough 
to those due to mass layoffs, it may be possible to use observed behavioral changes as 
additional predictors of general levels of unemployment. To perform this analysis, we use 
another CDR data set covering approximately 10 million subscribers in a different European 
country, which has been studied in prior work HH S51 [52H5H |57]. This country experienced 
enormous macroeconomic disruptions, the magnitude of which varied regionally during the 
period in which the data are available. We supplement the CDR data set with quarterly, 
province-level unemployment rates from the EU Labor Force Survey, a large sample survey 
providing data on regional economic conditions within the EU (see SI for additional details). 

We compute seven aggregated measures identified in the previous section: call volume, 
incoming calls, outgoing calls, number of contacts, churn, number of towers, and radius of 
gyration. Distance from home was omitted due to strong correlation with radius of gyration 
while calls to the town was omitted because it is not applicable in a different country. For 
reasons of computational cost, we first take a random sample of 3000 mobile phone users 
for each province. The sample size was determined to ensure the estimation feature values 
are stable (see SI Figure 6 for details). We then compute the seven features aggregated 
per month for each individual user. The k -th feature value of user i at month t is denoted 
as Ui } t,k and we compute month over month changes in this quantity as y' itk = A 

normalized feature value for a province s, is computed by averaging all users in selected 
province y s ,t,k = Yhi^ s y'i,t,k- I n addition, we use percentiles of the bootstrap distribution to 
compute the 95% confidence interval for the estimated feature value. 

After aggregating these metrics to the province level, we assess their power to improve 
predictions of unemployment rates. Note that we do not attempt to identify mass layoffs 
in this country. Instead, we look for behavioral changes that may have been caused by 


12 



layoffs and see if these changes are predictive of general unemployment statistics. First, we 
correlate each aggregate measure with regional unemployment separately, finding significant 
correlations in the same direction as was found for individuals (see Table II in the SI). We 
also find the strong correlations between calling behavior variables, suggesting that principal 
component analysis (PCA) can reasonably be used to construct independent variables that 
capture changes in calling behavior while guarding against co-linearity. The first principal 
component, with an eigenvalue of 4.10, captures 59% of the variance in our data and is 
the only eigenvalue that satisfies the Kaiser criterion. The loadings in this component are 
strongest for social variables. Additional details on the results of PCA can be found in the SI 
Tables III and IV. Finally, we compute the scores for the first component for each observation 
and build a series of models that predict quarterly unemployment rates in provinces with 
and without the inclusion of this representative mobile phone variable. 

First, we predict the present by estimating a regression of a given quarter’s unemployment 
on calling behavior in that quarter (e.g. using phone data from Q1 to predict unemployment 
in Ql). As phone data is available the day a quarter ends, this method can produce predic¬ 
tions weeks before survey results are tabulate and released. Next, we predict the future in 
a more traditional sense by estimating a regression on a quarter’s surveyed unemployment 
rate using mobile phone data from last quarter as a leading indicator (e.g. phone metrics 
from Ql to predict unemployment rates in Q2). This method can produce more predictions 
months before surveys are even conducted. See Figure 3 in the SI for a detailed timeline 
of data collection, release, and prediction periods. We have eight quarters of unemploy¬ 
ment data for 52 provinces. We make and test our predictions by training our models on 
half of the provinces and cross-validate by testing on the other half. The groups are then 
switched to generate out of sample predictions for all provinces. Prediction results for an 
AR1 model that includes a CDR variable are plotted against actual unemployment rates 
in Figure [4j We find strong correlation coefficients between predictions of predictions of 
present unemployment rates (p = 0.95) as well as unemployment rates one quarter in the 
future (p = 0.85). 

As advocated in [38] it is important to benchmark these type of prediction algorithms 
against standard forecasts that use existing data. Previous work has shown that the perfor¬ 
mance of most unemployment forecasts is poor and that simple linear models routinely out¬ 
perform complicated non-linear approaches [621 - 165] and the dynamic stochastic general equi- 


13 


librium (DSGE) models aimed at simulating complex macro economic interactions [661 [6T|. 
With this in mind, we compare predictions made with and without mobile phone covariates 
using three different model specifications: AR1, AR1 with a quadratic term (AR1 Quad), 
AR1 with a lagged national GDP covariate (AR1 GDP). In each of these model specifica¬ 
tions, the coefficient related to the principal component CDR score is highly significant and 
negative as expected given that the loadings weigh heavily on social variables that declined 
after a mass layoff (see tables V and VI in the SI regression results). Moreover, adding 
metrics derived from mobile phone data significantly improves forecast accuracy for each 
model and reduces the root mean squared error of unemployment rate predictions by be¬ 
tween 5% and 20% (see inserts in Figure [4]). As additional checks that we are capturing true 
improvements, we use mobile phone data from only the first half of each quarter (before 
surveys are even conducted) and still achieve a 3%-10% improvement in forecasts. These 
results hold even when variants are run to include quarterly and province level fixed effects 
(see tables VII and VIII in the SI). 

In summary, we have shown that features associated with job loss at the individual level 
are similarly correlated with province level changes in unemployment rates in a separate 
country. Moreover, we have demonstrated the ability of massive, passively collected data 
to identify salient features of economic shocks that can be scaled up to measure macro 
economic changes. These methods allow us to predict “present” unemployment rates two 
to eight weeks prior to the release of traditional estimates and predict “future” rates up to 
four months ahead of official reports more accurately than using historical data alone. 


DISCUSSION 

We have presented algorithms capable of identifying employment shocks at the individual, 
community, and societal scales from mobile phone data. These findings have great practi¬ 
cal importance, potentially facilitating the identification of macro-economic statistics with 
much finer spatial granularity and faster than traditional methods of tracking the economy. 
We can not only improve estimates of the current state of the economy and provide predic¬ 
tions faster than traditional methods, but also predict future states and correct for current 
uncertainties. Moreover, with the quantity and richness of these data increasing daily, these 
results represent conservative estimates of its potential for predicting economic indicators. 
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FIG. 4. Predicting unemployment rates using mobile phone data. We demonstrate that aggregating 
measurements of mobile phone behaviors associated with unemployment at the individual level 
also predicts unemployment rates at the province level. To make our forecasts, we train various 
models on data from half of the provinces and use these coefficients to predict the other half. 
Panel A compares predictions of present unemployment rates to observed rates and Panel B shows 
predictions of unemployment one quarter ahead using an AR1 model that includes co-variates of 
behaviors measured using mobile phones. Both predictions correlate strongly with actual values 
while changes in rates are more difficult to predict. The insets show the percent improvement 
to the RMSE of predictions when mobile phone co-variates are added to various baseline model 
specifications. In general, the inclusion of mobile phone data reduces forecast errors by 5% to 20%. 


The ability to get this information weeks to months faster than traditional methods is ex¬ 
tremely valuable to policy and decision makers in public and private institutions. Further, it 
is likely that CDR data are more robust to external manipulation and less subject to service 
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provider algorithmic changes than most social media [38]. But, just as important, the micro 
nature of these data allow for the development of new empirical approaches to study the 
effect of economic shocks on interrelated individuals. 

While this study highlights the potential of new data sources to improve forecasts of 
critical economic indicators, we do not view these methods as a substitute for survey based 
approaches. Though data quantity is increased by orders of magnitude with the collection 
of passively generated data from digital devices, the price of this scale is control. The 
researcher no longer has the ability to precisely define which variables are collected, how 
they are defined, when data collection occurs making it much harder to insure data quality 
and integrity. In many cases, data is not collected by the researcher at all and is instead 
first pre-processed by the collector, introducing additional uncertainties and opportunities 
for contamination. Moreover, data collection itself is now conditioned on who has specific 
devices and services, introducing potential biases due to economic access or sorting. If policy 
decisions are based solely on data derived from smartphones, the segment of the population 
that cannot afford these devices may be underserved. 

Surveys, on the other hand, provide the researcher far more control to target specific 
groups, ask precise questions, and collect rich covariates. Though the expense of creating, 
administering, and participating in surveys makes it difficult to collect data of the size and 
frequency of newer data sources, they can provide far more context about participants. 
This work demonstrates the benefits of both data gathering methods and shows that hybrid 
models offer a way to leverage the advantages of each. Traditional survey based forecasts 
are improved here, not replaced, by mobile phone data. Moving forward we hope to see 
more such hybrid approaches. Projects such as the Future Mobility SurveyjSEj and the MIT 
Reality Mining project [23] bridge this gap by administering surveys via mobile devices, 
allowing for the collection of process generated data as well as survey based data. These 
projects open the possibility to directly measure the correlation between data gathered by 
each approach. 

The macro-economy is the complex concatenation of interdependent decisions of millions 
of individuals [69]. To have a measure of the activity of almost every individual in the 
economy, of their movements and their connections should transform our understanding of 
the modern economy. Moreover, the ubiquity of such data allows us to test our theories at 
scales large and small and all over the world with little added cost. We also note poten- 
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tial privacy and ethical issues regarding the inference of employment /unemployment at the 
individual level, with potentially dire consequences for individuals’ access, for example, to 
financial markets. With the behavior of billions being volunteered, captured, and stored at 
increasingly high resolutions, these data present an opportunity to shed light on some of the 
biggest problems facing researchers and policy makers alike, but also represent an ethical 
conundrum typical of the “big data” age. 
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SUPPLEMENTARY INFORMATION 

Materials and Methods 

CDR Data set 1 (Dl): We analyze call detail records (CDRs) from two industrialized 
European countries. In the first country, we obtain data on 1.95 million users from a 
service provider with roughly 15% market share. The data run for 15 months across 
the years 2006 and 2007, with the exception of a gap between August and September 
2006. Each call record includes a de-identified caller and recipient IDs, the locations of 
the caller’s and recipient’s cell towers and the length of the call. Caller or recipients on 
other network carriers are assigned random IDs. There are approximately 1.95 million 
individuals identified in the data, 453 million calls, and 16 million hours of call time. 
The median user makes or receives 90 calls per months. 
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CDR Data set 2 (D2): The second data set contains 10 million users (roughly 20% mar¬ 
ket share) within a single country over three years of activity. Like Dl, each billing 
record for voice and text services, contains the unique identifiers of the caller placing 
the call and the callee receiving the call, an identifier for the cellular antenna (tower) 
that handled the call, and the date and time when the call was placed. Coupled with 
a data set describing the locations (latitude and longitude) of cellular towers, we have 
the approximate location of the caller when placing the call. For this work we do not 
distinguish between voice calls and text messages, and refer to either communication 
type as a “call.” However, we also possess identification numbers for phones that are 
outside the service provider but that make or receive calls to users within the company. 
While we do not possess any other information about these lines, nor anything about 
their users or calls that are made to other numbers outside the service provider, we do 
have records pertaining to all calls placed to or from those ID numbers involving sub¬ 
scribers covered by our data set. Thus egocentric networks between users within the 
company and their immediate neighbors only are complete. This information was used 
to generate egocentric communication networks and to compute the features described 
in the main text. From this data set, we generate a random sample population of k 
users for each of the provinces, and track each user’s call history during our 27-month 
tracking period (from December 2006 to March 2009). We discuss how the sample 
size is chosen in a following subsection. Finally, we note that due to an error in data 
extraction from the provider, we are missing data for Q4 in 2007. 

The use of CDR data to study mobility and social behaviors on a massive scale is becoming 
increasingly common. In addition to its large scale, its format is generally consistent between 
countries and mobile operators. In the context of this study, each mobile phone data set 
contains five columns: 1) an anonymized, unique identifier for the caller, 2) the ID of the 
tower through which the caller;s call was routed, 3) an anonymized, unique identifier of the 
receiver of the call, 4) the ID of the tower through which the receiver’s call was routed, 
and 5) the timestamp down to the second which the call was initiated. In order to obtain 
the location of both caller and receiver, data is restricted to only calls between members of 
the same mobile operator. The tower IDs reflect the tower used upon starting the call and 
we have no information on changes in location that may be made during the call. We also 
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obtain a list of latitudes and longitudes marking the coordinates of each tower. Although 
calls are generally believed to be routed through the tower that is geographically closest to 
the phone, this may not be the case if the signal is obscured by buildings or topology. For 
this reason, we consider a cluster of towers near the geographic area in question instead of 
a single tower. 

To ensure privacy of mobile phone subscribers, all identifiers were anonymized before we 
received the data and no billing or demographic information was provided on individuals or 
in aggregate. 


Filtering CDR Data 

We limit our sample to mobile phone users who either make or receive at least ten calls 
connecting through one of the three cell towers closest to the manufacturing plant of interest. 
In addition, we require that users make at least one call in each month spanned by a given 
data set to ensure users are still active. 


Manufacturing plant closure 

We gather information on a large manufacturing plant closing that affected a small com¬ 
munity within the service provider’s territory from news articles and administrative sources 
collected by the country’s labor statistics bureau. The plant closure occurred in Decem¬ 
ber 2006 and involved 1,100 employees at an auto-parts manufacturing plant in a town of 
roughly 15,000 people. 


Town Level Structural Break Model 

We model the pre-closure daily population of the town as consisting of three segments: 
a fraction of non-resident plant workers 7 , a fraction of resident workers //, and a fraction 
of non-workers normalized to (1 — 7 — / 1 ). We postulate that each individual i has a flow 
probability of making or receiving a call at every moment p*. Workers spend a fraction ^ 
of their day at their jobs and thus make, in expectation, pi call on a given day during 
work hours. When losing their job in the town, both resident and non-resident workers are 
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re-matched in national, not local, labor market. 


Given this model, the expected daily number of cell phone subscribers making or receiving 
calls from the three towers serving the plant and town: 


vol 


{ 7 # + (1 - 7 )p for t < ti ayof f 

Ml “ i>)P+ (1 - 7 - Mp for t> tiayoff 


This model predicts a discrete break in daily volume from the towers proximate to the 
plant of (7 + at the date ti ayo ff • For workers, the predicted percentage change in call 
volume from these towers is Non-workers experience no change. 

(HP+'y'ipp ^ o 
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Individual Structural Break Model 


We fit a model similar to the community structural break model to data for each individual 
user, i, based on the probability they made a call from the town on each day. For each 
individual, we use the non-linear estimator to select a break date t\ ayo ff , and constant pre- 
and post- break daily probabilities p ijt<t i and ff to minimize the squared deviation 

from each individuals’ data. Figure [5] plots the distribution of break-dates for individuals. As 
expected, there is a statistically significant spike in the number of individuals experiencing 
a break in the probability of making a call from the town at the time of the closure and 
significantly fewer breaks on other, placebo dates. These two methods provide independent, 
yet complementary ways of detecting mass layoffs in mobile phone data. 



FIG. 5. We plot the distribution of break dates for the structural break model estimated for 
individuals. We find a strong, statistically significant peak centered on the reported closure date 
(red) with far fewer breaks on other, placebo dates. This is consistent with both our community 
wide model as well as the Bayesian model presented above. 
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Bayesian Estimation 


On an individual level, Figure [6]4 shows days on which each user makes a call near 
the plant ranked from highest to lowest probability weight. Figure [6^3 provides greater 
detail for users probability weights between 50% and 100%. Users highly suspected of being 
laid off demonstrate a sharp decline in the number of days they make calls near the plant 
following the reported closure date. Figure [6p graphs the inverse cumulative distribution of 
probability weights. While we do not have ground-truth evidence that any of these mobile 
phone users was laid off, we find more support for our hypothesis by examining a two week 
period roughly 125 days prior to the plant closure. 

Figures |6]4 and[6p> illustrate that the call patterns of users assigned the highest probabil¬ 
ities change significantly after the plant closure. These users make calls from the town on a 
consistent basis before the layoff, but make significantly fewer calls from the town afterwards. 
In contrast, the call patterns of users assigned the lowest weights do not change following 
the plant closure. In aggregate, we assign 143 users probability weights between 50% and 
100%. This represents 13% of the pre-closure plant workforce this fraction compares closely 
with the roughly 15% national market share of the service provider. 


The European Labor Force Survey 

Each quarter, many European countries are required to conduct labor force surveys to 
measure important economic indicators like the unemployment rates studied here. In person 
or telephone interviews concerning employment status are conducted on a sample size of less 
than 0.5% of the population. Moreover, participants are only asked to provide responses 
about their employment status during a 1 week period in the quarter. 

These “microdata” surveys are then aggregated at the province and national levels. Con¬ 
firmed labor force reports and statistics for a particular quarter are released roughly 14 weeks 
after the quarter has ended. For example, Q1 of 2012 begins January 1st, 2012 and ends 
March 31st, 2012. The survey data is analyzed and unemployment numbers are released 
between two and three weeks following the end of the quarter. These numbers, however, are 
unconfirmed and subject to revisions which can occur at any time in the following quarters. 
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Rank 


FIG. 6. Identifying affected individuals. A) Each user is represented by a row where we fill in a 
day as colored if a call was made near the plant on that day. White space marks the absence of 
calls. Rows are sorted by the assigned probability of that user being laid off. B) A closer view of 
the users identified as mostly to have been laid off reveals a sharp cut off in days on which calls 
were made from the plant. C) An inverse cumulative distribution of assigned probability weights. 
The insert shows an enlarged view at the probability distribution for the 150 individuals deemed 
most likely to have been laid off. 

The Effect of Job Loss on Call Volumes 

We measure the effect of job loss on six properties of an individual’s social behavior and 
three mobility metrics. 


CDR Metrics 


Calls:: The total number of calls made and received by a user in a given month. 
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FIG. 7. A timeline showing the various data collection and reporting periods. Traditional survey 
method perform surveys over the course of a single week per quarter, asking participants about their 
employment status during a single reference week. Unofficial survey results, subject to revision 
are then released a few weeks following the end of the quarter. Mobile phone data, however, is 
continually collected throughout the quarter and is available for analysis at any time during the 
period. Analysis of a given quarter can be performed and made available immediately following 
the end of the month. 


Incoming:: The number of calls received by a user in a given month. 

Outgoing:: The number of calls made by a user in a given month 

Contacts:: The number of unique individuals contacted by a user each month. Includes 
calls made and received. 

To Town:: The fraction of a user’s calls made each month to another user who is physically 
located in the town of the plant closure at the time the call was made. 

Churn:: The fraction of a user’s contacts called in the previous month that was not called 
in the current month. Let Ct be the set of users called in month t. Churn is then 
calculated as: churn t = 1 — 

Towers:: The number of unique towers visited by a user each month. 

Radius of Gyration, R g :: The average displacement of a user from his or her center of 
mass: R g = Xq=i IC — fcm| 2 , where n is the number of calls made by a user in 
the month and r cm is the center of mass calculated by averaging the positions of all 
a users calls that month. To guard against outliers such as long trips for vacation or 
difficulty identifying important locations due to noise, we only consider months for 
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users where more than 5 calls were made and locations that where a user recorded 
more than three calls. 

Average Distance from Top Tower, R\:: The average displacement of a user from their 
most called location: R\ = Sj'=i I'O — 'ffil 2 , where n is the number of calls made 
by a user in the month and rq is the coordinates of the location most visited by the 
user. To guard against outliers such as long trips for vacation or difficulty identifying 
important locations due to noise, we only consider months for users where more than 
5 calls were made and locations that where a user recorded more than three calls. 

Measuring Changes 

For each user i, we compute these metrics monthly. Because individuals may have differ¬ 
ent baseline behaviors, we normalize a user’s time series to the month immediately before 
the layoff denoted t*. To assess differences in behavior as a result of the mass layoff, we 
construct three groups: (1) A group of laid off users from the town where the probability of 
being laid off is that calculated in the previous section, (2) a town control group consisting 
of the same users as group 1, but with inverse weights, and (3) a group of users selected at 
random from the country population. Each user in the final group is weighted equally. 

For each month, we compute the weighted average of all metrics then plot the difference 
between the laid off group and both control groups in Figure 3 in the main text. 


yt = Y, 

i 



( 2 ) 


Ayt VI Vt, control 


( 3 ) 


We estimate changes in monthly behavior using OLS regressions. We specify two models 
that provide similar results. For a metric : 


Vi — oii + fiiAi + /3 2 Tj + fizAjUi (4) 

where Ai is a dummy variable indicating if the observation was made in a month before 
or after the plant closure and U l is a dummy variable that is 1 if the user was assigned a 
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greater than 50% probability of having been laid off and 0 otherwise. An alternate model 
substitutes the probability of layoff itself, for the unemployed dummy: 


Vt — oii + /3i Ai + ^Wi + flzAiWi 


( 5 ) 


In many cases, we are more interested in relative changes in behavior rather than absolute 
levels. For this, we specify a log-level model of the form: 


log (yi) = a* + 0i Ai + fowi + foAiWi 


( 6 ) 


Now the coefficient 03 can be interpreted as the percentage change in feature yi tTl experienced 
by a laid off individual in months following the plant closure. Changes to mobility metrics 
as well as changes to total, incoming, and outgoing calls were estimated using the log-level 
model. Churn and To Town metrics are percentages already and are thus estimated using 
a level-level model. The changes in the number of contacts each also estimated using a 
level-level model. 

Models are estimated using data from users believed to be unemployed and data from 
the two control groups. Results are shown in Table ??. Comparisons to each group produce 
consistent results. 

In addition relative changes, we also measure percent changes in each variable pre- and 
post-layoff. Figure [8] shows the percent change in each variable for an average month before 
and after the plant closing. We report changes for three groups, those we believe were 
unaffected by the layoff, but live in the tower, those we believe are unaffected, but live 
elsewhere in the country, and those with a probability p > 0.5 of being laid off. The laid off 
group shows significant changes in all metrics, while the town and country control groups 
show few. 

Predicting Province Level Unemployment 

To evaluate the predictive power of micro-level behavioral changes, we use data from a 
different undisclosed industrialized European country. As discussed in the main text, we 
use call detail records spanning nearly 3 years and the entire user base of a major mobile 
phone provider in the country. For each of the roughly 50 provinces within this country, 
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FIG. 8. For each sample of mobile phone users (those we believe to be unaffected by the layoff 
living in the town and the country as well as those with a probability p > 0.5 of being laid off), we 
plot the percent change in each variable before and after the layoff. 


we assemble quarterly unemployment rates during the period covered by the CDR data. 
At the national level, we collect a time series of GDP. We select a sample of users in each 
province and measure the average relative value of 7 of the variables identified to change 
following a layoff. To-town and distance from home variables are omitted as the former is 
only measured when we know the location of the layoff and the latter is strongly correlated 
with R g . 


First, we correlate each aggregate calling variable with unemployment at the regional 
level. To control for differences in base levels of unemployment across the country, we first de¬ 
mean unemployment and each aggregate variable. Table ?? shows that each calling behavior 
is significantly correlated with unemployment and that these correlations are consistent 
with the directions found in the individual section of the paper. Moreover, we discover 
strong correlation between each of the calling behavior variables, suggesting that principal 
component analysis is appropriate. 
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Principal Component analysis 


As shown in the individual section of the paper, changes in these variables following a mass 
layoff are correlated. This correlation is seen in province level changes as well (Table ??). 
Given this correlation, we use principal component analysis (PCA) extract an independent 
mobile phone variable and guard against co-linearity when including all phone variables as 
regressors. The results from PCA and the loadings in each component can be found in Table 
?? and Table ??, respectively. We find only the first principal component passes the Kaiser 
test with an eigenvalue significantly greater than 1, but that this component captures 59% 
of the variance in the calling data. The loadings in this component fall strongly on the social 
variables behavior. We then compute the scores for this component for each observation in 
the data and use these scores as regressors. The prominent elements of the first principal 
component are primarily related to the social behavior of callers. 


Model Specification 

We make predictions of present and future unemployment rates using three different 
models specifications of unemployment where each specification is run in two variants, one 
with the principal component score as an additional independent variable denoted as CDR f 
and the other without. The twelve models are described as follows: 


1. AR(1) 


Ut — a\Ut~i 

(7) 

U t = 0M-! + 7 CDR t 

(8) 

Ut +1 = oi\Ut-i 

(9) 

U t +i = 0\U t -i + 'yCDRt 

(10) 
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2. AR(1) + Quad 


Ut — Oi\Ut~ i + Oi2Ut_i 

in) 

Ut = /% Ut -_! + foU?-! + iCDRt 

( 12 ) 

Ut+ 1 = Oi\Ut- 1 + a^D# 2 -! 

(13) 

Ut+i = /%£/*_! + fhUl, + ^CDRt 

(14) 


3. AR(1) + GDP 


U t = ot\U t -1 + ot 2 gdpt-i 

(15) 

Ut = fiiUt-i + /3 2 gdp t -i + 7 CDR t 

(16) 

Ut- i-i = oi\Ut-\ + cx 2 gdpt_i 

(17) 

U t+ i = (3\Ut-i + /3 2 gdp t -i + 7 CDR t 

(18) 


To evaluate the ability of these models to predict unemployment, we use a cross-validation 
framework. Data from half of the provinces are used to train the model and these coefficients 
are used to predict unemployment rates given data for the other half of the provinces. We 
perform the same procedure switching the training and testing set and combine the out of 
sample predictions for each case. We evaluate the overall utility of these models by plotting 
predictions versus observations, finding strong correlation (see the main text). To evaluate 
the additional benefit gained from the inclusion of phone data, we compute the percentage 
difference between the same model specification with and without the mobile phone data, 
A RMSE% = 1 — RMSE w /cdr/RMSE w / 0Ut .In each case, we find that the addition of 
mobile phone data reduces the RMSE by 5% to 20%. 


Predictions using weekly CDR Data 

Until now, we have used data from the entire quarter to predict the results from the 
unemployment survey conducted in the same quarter. While these predictions would be 
available at the very end of the quarter, weeks before the survey data is released, we also 
make predictions using CDR data from half of each quarter to provide an additional 1.5 
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months lead time that may increase the utility of these predictions. We estimate the same 
models as described in the previous section and find similar results. Even without full access 
to a quarter’s CDR data, we can improve predictions of that quarter’s unemployment survey 
before the quarter is over by 3%-6%. 


The Effect of Sample Size on Feature Estimation 

It is important to consider the extent to which the sampling size is sufficient and does not 
affect much the feature estimation. We study the reliability of sample size (k) in terms of 
relative standard deviation (RSD). For each given sample size k, we sample T times (without 
replacement) from the population. The RSD with respect to sample size k for a particular 
feature, is given by RSD(k) = jk where s*. is the standard deviation of the feature estimates 
from the T samples, and /*. is the mean of the feature estimates from the T samples. We use 
T = 10 to study the feature reliability. In Figure ??, we plot the different features’ %RSD 
by averaging the RSD values of all provinces. The plots show that the values of %RSD over 
sample size k = 100,200, ...,2000 decrease rapidly. When sample size k = 2000, the %RSD 
for all features, except for radius of gyration ( R g ), is lower than 1%. The estimates of R g 
exhibit the highest variation; however, we can still obtain reliable estimates with thousands 
of sampled individuals ( RSDfk ) = 0.026 for R g , with k = 2000). For the results in the 
manuscript, a value of k=3000 was chosen with the confidence that sample size effects are 
small. 


Mass Layoffs and General Unemployment 

While mass layoffs provide a convenient and interesting natural experiment to deploy our 
methods, they are only one of many employment shocks that economy absorbs each month. 
We have measured changes in call behaviors due to mass layoffs, but these changes may 
be unhelpful if they do not result from other forms of unemployment like isolated layoffs 
of individual works. Though it is beyond the scope of this work to directly determine if 
individuals affected by mass layoffs experience the same behavioral changes as those experi¬ 
encing unemployment clue to other reasons, strong correlations have been observed between 
the number of mass layoffs observed in a given time period and general unemployment rates 
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- http://www.bls.gov/news.release/pdf/mmls.pdf 
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